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Front End Architecture 


Serial Computer 


oa ee >| cru Jas Bus Adaptor 
emory 


VME, BI, LBus 
Memory Bus 


User programs run on front end 

Macrocode calls are made to the sequencer 
Front end and CM sequencer are closely coupled 
Performance for array transfer: 


- VAX 8800 ~<—®* sequencer 1 MbByte per sec 
> LISPM 4—» sequencer <3 MBytes per sec 
> SUN 4 «—p» sequencer 4 MBytes per sec 


CM 


Thinking Machines Corporation 


Digitized by the Internet Archive 
in 2023 with funding from 
Kahle/Austin Foundation 


https://archive.org/details/thinkingmachinesOOunse_1 


Processor Section 


12 


CM Chip 
16 bit-serial 
processors 
and router 


32 sets 16 + 6 
64K x 1 bits [ 


router wires 
1 VO 


RAM 


CM Chip 
16 bit-serial 
processors 
and router 


8 Data chips 16 + 6 


3 ECC chips rt Er at 


router wires 


enable 


Floating Point 


+” Unit 
20 status 


sequencer address 


Section is repeated 2,048 times in a full CM-2 . 


Instructions are broadcast 
Memory addresses can be broadcast or locally generated (indirect addressing) 


The memory and floating point units are commercial parts 
CM and Sprint chips are standard cell/custom CMOS parts 
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* 68 pin CMOS 
© 8 MHz 
- Flat memory hierarchy 
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- Bit serial «—» word parallel conversion 
¢ Floating point interface 


¢ Indirect addressing 
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Floating Point Unit 
Sprint Chip 


Instruction from 
sequencer 


Register File 
32- or 64-bit x 32 


Batch 32 processors gets vector perfomrance 
Virtual Processor pipelining 

Floating point add or multiply is one "Add times" 
Floating point divide is 3 "Add times" 
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Calling PARIS instructions 


¢ Any field in the current VP set can be passed to a PARIS instruction 
as afield operand. For example, if A and B are fields that are at least 
32 bits and allocated in the current VP set: 


CM_s_ add 2 1L(a,b,32) 


the values in each virtual processor get added together according 
to the context flag 


+ Fields can have offsets 
CM_add_offest_to_field_id_(field_id, offset) 


returns an offseted field ID 
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Calling PARIS instructions 
(continued) 


¢ Offseted field IDs can be passed to any PARIS instruction 
as a regular field. For example, suppose DESTINATION and 
SOURCE are 64 bit fields used to hold complex numbers. If the 
first 32 bits are the real part and the second 32 bits are the 
imaginary part, a complex add library routine could be written 
as follows: 


subroutine complex_add(destination, source) 
integer destination, source 
call CM_f add 2 1L(destination, source, 23, 8) 
call CM_f add 2 1L(CM_ add offset_to_field_id(destination, 32), 
$ CM_add _ offset_to field_id(source, 32), 23, 8) 
return 
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Virtual Processor Scheme 


Example: 16K CM-2 initial-dimensions '(256 256) 
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Virtual Processor Scheme 


Example: 16K CM-2 initial-dimensions ‘(256 256) 
(VP Ratio 4:1) 


LINE PVAR << PIXEL PVAR 
(e.g. 9,000 Lines from 65,000 Pixels) 


PDYSICal PlOCESS Ol sai eee eee > 


PO P1 
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VPO LINE-O 


PIXEL PVAR 
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LINE PVAR 


VP3 


inactive 


[_] active 


PIXEL PVAR 

65,000 Data Elements 
(VP 4:1) 

LINE PVAR 

9000 Data Elements 


(veal) 
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New Virtual Processors 


¢ 


Solve bigger problems 


¢ Solve current problems more efficiently 


od 


Overall system performance gain 


¢ Better memory utilization 
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Overview 


¢ Why do we need VP’s? 

¢ Whatis a VP? 

¢ How does a VP work? 

¢ Why do we need new VP’s? 

¢ What are new VP's 

¢ How do we use the new VP scheme? 

¢ How do new VP’s work? 

¢ How are new VP’s back compatible with the current VP scheme? 
¢ What is the performance impact of using new VP’s? 


¢ What are possible future paths for increasing performance for new VP’s? 
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Why do we need VP’s? 


¢ Most data intensive problems have more data items than the 
number of physical processors in the Connection Machine. 
Virtual Processors are a tool for configuring the Connection 
Machine to have as many processors as Is necessary for 
solving a users problem. 


¢ Thinking Machines delivers different size Connection 
Machines, virtual processors are needed in order to run a 
fixed size problem on different size Connection Machines. 
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What is a Virtual Processor ? 


¢ Like a small simple CPU and associated memory 


¢ Contains variables and flags 


¢ One flag is used as the context flag (turns the virtual processor 
on or off for a particular PARIS instruction) 


¢ Contains a stack (one global stack across all of the virtual processors) 
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Physical Scratch Memory 


} Virtual Flags (4 bits) 


~~ we BBB BUD DBA BE DS BSB SS BE BBE BABB SSS SB SES 


} Virtual Flags (4 bits) 


~ewe BS SEE BB DB SB SBE BEES BSB DD SEBS BBD BSE SS SESS 


[ Context Flag 


al Carry Flag 


[_] Overflow Flag 


[ Test Flag 
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How does a VP work? 


For each virtual processor: 
¢ Load up physical flags from virtual flags 


¢ Perform operation on virtual address 
(Physical base plus virtual address) 


¢ Save virtual flags from physical flags 


For a VP ratio of N: 

¢ Each VP gets ~64K/N bits of memory 

¢ The instruction takes N times longer to operate. 
(except for certain operations which are 
pipelined through the VP cycle) 
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Why do we need new VP’s? 


¢ Most problems have varying size data sets. 


The current VP scheme may waste both memory and time, 
consequently problem size may be limited by the limitations 
of the virtual processor scheme. 


¢ Examples of problems that have varying size data sets: 


Image Analysis 
Number of Pixels in an image 
Number of line segments in an image 
Number of vertices of objects in an image 
Number of objects in an image 


VLSI 
Number of transistors 
Number of gate equivelents 
Number of wires 


Database 


Number of characters in the database 
Number of words in the database 
Number of records in the database 
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What are new VP’s? 


The new VP scheme is a way to vary data size for different 
parts of algorithms 


Like the current scheme in that each VP contains variables and flags 


One flag is the context flag 


Variables (fields) can be allocated on the stack or out of the heap 
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Physical Scratch 
Memory 
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How do new VP’s Work? 
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Field ID 


Processor Number (VPR = 1) 
Uae lee oe 64 K 
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VP Heap 


Set #2 


Field ID 
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VP Sets 


¢ A-set of VP fields, some of which are used 
as flags 


¢ Has a particular VP ratio 


¢ Most paris instructions only operate on fields that 
are in the current VP set 
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Conceptual overview of the new VP scheme 


¢ There are sets of Virtual Processors 

¢ Only one VP set Is active at any moment 

¢ Different VP sets can have different numbers of processors 
¢ Data can be communicated between fields in different VP sets 


¢ CM memory is not a contiguous chunk of bits 
(In the new VP scheme it is broken up into 
arbitrary length fields) 
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Conceptual overview of the new VP scheme 
(continued) 


¢ There is one global stack (per physical processor) 


od 


There is one global heap (per physical processor) 


Old programs will continue to work (back compatibility mode) 


» 


¢ Time of execution is proportional to VP ratio 


Sd 


Memory usage is proportional to VP ratio 
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How to use the new VP scheme 


» Creating VP sets 

» selecting VP sets 

+ Allocating and deallocating fields 
+ Declaring flags 

Calling paris instructions 

» safety 
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Selecting VP sets 


¢ There is only one current VP set 
¢ Each VP set has Its own virtual flags 


¢ CM:set-vp—set VP-set 


— makes VP-set current, 
— makes its flags the current virtual flags 
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Allocating and Deallocating Memory 


¢ The basic unit of connection machine memory is a field 


¢ A field is achunk of contiguous bits across the current number 
of processors 


¢ A field that is N bits long take up N * VPR bits of memory 
(VPR its the virtual processor ratio) 


¢ There are two kinds of fields: 
— Stack fields 
— Heap fields 
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Stack Fields 


Stack fields get allocated by a first allocated, 
last deallocated scheme 


Fields allocated on the stack are contiguous 


CM:allocate—stack—field length 
allocates a LENGTH bit long stack field in the current VP set 


CM:deallocate—stack—fields N 


deallocates the last N allocated stack fields 


CM:deallocate—upto—stack—field Field—ID 
deallocates all stack fields (including Field—ID) 
allocated since Field—ID was allocated 


CM:deallocate—upto—stack—field—e Field—ID 
deallocates all stack fields (excluding Field—ID) 
allocated since Field—ID was allocated 


Thinking Machines Corporation 


f pes 
¢@ r) Lee fier 


a ~ 


rs | " foe ne, ie 7 7 re v ia 1 MeL Ab 2a] 


- 


~pieg te counGnony 


Gt “WS @ fOr 6 ee wt a ey ae ee ey P= 


Heap Fields 


¢ There is a pool of memory growing down from the high end of 
processor memory called the heap. 


CM:allocate—heap-—field length 

allocates a length bit long heap field in the current VP set 
CM:deallocate—heap-—field Field—ID 

deallocates heap field Field—ID 


¢ Fragmentation may occur from allocating and 
deallocating different size fields 
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Flags 


¢ There are four possible flags used by PARIS 


CM:CONTEXT-FLAG 
CM:OVERFLOW-FLAG 
CM:CARRY-FLAG 
CM:TEST-—FLAG 


¢ Any one bit field in the correct VP set can be used as the context, 
overflow, carry or test flags. 


¢ Each VP set has its own flags 


@ Changing the current VP set makes that VP set’s flags current 
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Calling PARIS instructions 


Any field in the current VP set can be passed to a PARIS instruction 
as afield operand. For example, if A and B are fields that are at least 
32 bits and allocated in the current VP set: 


(CM:s—add—2-1L A B 32) 


the values in each virtual processor get added together according 
to the context flag 


Fields can have offsets 
(CM:add-—offset—to—field—id Field-ID Offset) 


returns an offseted field ID 
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Calling PARIS instructions 
(continued) 


¢ Offseted field IDs can be passed to any PARIS instruction 
as a regular field. For example, suppose DESTINATION and 
SOURCE are 64 bit fields used to hold complex numbers. If the 
first 32 bits are the real part and the second 32 bits are the 


imaginary part, a complex add library routine could be written 
as follows: 


(defun complex—add (destination source) 
(CM:f—add—2-1L destination source) 
(CM:f—add—2-1L (CM:add-offset—-to—field—id destination 32) 
(CM:add-offset—to-field—id source 32))) 
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Additional Features 


¢ Resizing Fields 


¢ Safety 


4 


4 
4 
¢ 


run time safety switch 

trade off error checking with execution speed 

does not affect correct programs 

some of the error checkers are: 

— makes sure that fields are in the current VP set 

— makes sure that fields are at least as long as the passed length, 
for example in CM:LOGNOT(A,32) | 
makes sure that field A is at least 32 bits long 

— makes sure that field ID’s correspond to legal (allocated) fields 
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Communication 


¢ Within a VP set all communication works 
— router 
— spreads 
— scans 
— news 


¢ Between VP sets only the router works, for example: 
CM:send—1L(Dest, Send—Addr, Source, Length, Notify—bit) 


Each VP sends from SOURCE to field DEST in VP pointed to by SEND—ADDR. 
The send address length varies depending on the VP set of dest. 
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(CM:send-1L_ dest send-addr source length notify) 


Processor Number (VPR = 8) 


send-addr 
source 
context 
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Stack 
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AFTER Send 


Processor Number 
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Stack 
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Back Compatibility 


Paris programs written for previous releases of CM system software will continue to work. 
Here’s why: 


Current Field ID’s look like: 


| 16 bits (FDA) | 16 bits (offset) | 


FDA (Field decriptor address) is a pointer into a table that lists things like: 
physical base location, VP set, VP increment. 


To determine the base physical location for a field: 
LOCATION(FDA(Field—ID)) + OFFSET(Field--ID) 


In the CM2 memory addresses are less than 16 bits, if you look at 
a memory address as a field descriptor, the FDA is always zero. 


Setting up field zero at cold boot time during the new VP scheme has the correct behavior for 
back compatibility. 
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Back Compatibility 
(continued) 


¢ Programs written in PARIS that use any of the new release 5 
features must use fields. 


¢ Usually just-replacing pairs of CM:push-—space and 
CM:pop—and-—discard with CM:allocate—stack—field 
and CM:deallocate—stack-fields. 


¢ Programs written exclusively in higher level languages 
(*Lisp or C*) should not have to be modified in any way. 
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Performance 


Every PARIS instruction must decode its fields. 


¢ Decoding happens on the front end 
(performance depends on the front end) 


¢ Decoding happens once per field per paris instruction 


¢ Decoding happens on the front end in parellel with work being done 
on the Connection Machine. (the Connection Machine operates 
on the current instruction while the front end is decoding the 
next instructions fields) 


¢ For small VP ratios and small fields there might be a small performance 
decrease on certain front ends since release 5 allows for efficient 
manipulation of complex data structures, overall system performance 
should be significantly higher for people using different size data sets. 
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Conclusion 


Most problems use fairly complex data structures with varying 
size datasets. The new virtual processor scheme was developed 
to make operating on such data sets as efficient as possible. 


We believe that the new VP scheme is the software foundation for 
making the CM2 the fastest computer in the world for data inten- 
sive problems. 
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Conclusion 


Most problems use fairly complex data structures with varying 
size data sets. The new virtual processor scheme was developed 
to make operating on such data sets as efficient as possible. 


We believe that the new VP scheme is the software foundation for 
making the CM2 the fastest computer in the world for data inten- 
sive problems. 
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